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Abstract 


The  Naval  Space  Surveillance  Center  (NAVSPASUR)  uses  an  analytic  satellite  motion 
model  based  on  the  Brouwer  -  Lyddane  theory  to  track  objects  orbiting  the  Earth.  In 
this  paper  we  develop  several  parallel  algorithms  based  on  this  model.  These  have  been 
implemented  on  the  INTEL  iPSC/2  hypercube  multi-computer.  The  speed-up  and  efficiency 
of  these  algorithms  will  be  obtained.  We  show  that  the  best  of  these  algorithms  achieves 
87%  efficiency  if  one  uses  a  16-node  hypercube. 


Introduction 

The  Naval  Space  Surveillance  Center  (NAVSPASUR)  uses  an  analytic  satellite  motion  model 
to  track  objects  orbiting  the  Earth.  This  model  is  implemented  in  the  Fortran  subroutine 
PPT2.  This  subroutine  predicts  an  artificial  satellites’s  position  and  velocity  vectors  at  a 
selected  time  to  aid  in  the  tracking  endeavor.  Several  calls  to  the  subroutine  may  be  required 
to  aid  in  the  identification  of  one  object.  A  substantial  increase  in  the  number  of  objects  or  a 
desire  to  increase  the  accuracy  of  the  model  will  require  a  similar  increase  in  computer  time. 
Parallel  computing  offers  one  option  to  decrease  the  computation  time  without  sacrificing 
accuracy. 

For  a  multicomputer,  the  user  must  partition  the  problem  among  the  processors.  Two 
decompositions  are  possible  and  will  be  discussed  here,  control  decomposition  and  domain 
decomposition. 

In  this  paper  we  determine  the  parallel  computing  potential  of  the  current  NAVSPASUR 
model  as  applied  to  a  MIMD  computer  and  simulated  on  an  iPSC/2  hypercube.  In  the 
next  section,  we  develop  a  control  decomposition  method  and  discuss  the  speed-up  attained 
by  our  numerical  experiments  on  a  4-node  hypercube.  In  section  3,  we  discuss  a  domain 
decomposition  method.  We  show  that  domain  decomposition  yields  higher  speed-up.  We 
also  develop  a  model  showing  that  16  nodes  yield  optimal  efficiency  (almost  90%)  and  discuss 
how  to  utilize  larger  dimension  hypercubes  without  losing  efficiency. 


Control  Decomposition 


Control  decomposition  is  the  strategy  of  dividing  tasks  among  the  nodes.  This  is  recom¬ 
mended  for  problems  with  irregular  data  structures  or  unpredictable  control  flows  (see  Paral¬ 
lel  Programming  Primer  pp.  4-6  in  [5]).  The  exact  tasks  required  of  each  node  are  explicitly 
stated  in  the  parallel  program. 

In  order  to  predict  a  satellite’s  state  vector  considering  the  secular  and  periodic  correction 
terms  due  to  the  zonal  harmonics  and  a  correction  term  for  each  element  due  to  the  sectoral 
harmonics,  the  NAVSPASUR  model  requires  the  completion  of  55  major  tasks.  These  tasks 
are  described  by  Phipps  (1992).  The  first  step  in  partitioning  these  tasks  among  the  nodes 
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Table:  1  Concurrent  NAVSPASUR  Orbit  Model  Tasks 


Level 


1 

T2 

- 77 - 

(2.47-8) 

(2.50) 

2 

"4 

a 

e 

A£ 

(2.54) 

(2.41) 

(2.45) 

(2.38) 

3 

dg  /  d£ 

dh/d£ 

£" 

a” 

(2.51) 

(2.52) 

(2.55) 

(2.57) 

4 

g” 

h” 

£  1 1 

COS  / 

VLL2 

(2.55) 

(2.55) 

(2.58) 

(2.59) 

5 

$1  e 

e  S^t 

r” 

-  r  // 

sin  / 

(2.59) 

(2.59) 

(2.58) 

(2.58) 

6 

hi 

eS£ 

sin  (I /2)Sh 

S2z 

(2.59) 

(2.64) 

(2.66) 

(2.67) 

7 

SI 

e 

£ 

z 

(2.64) 

(2.70) 

(2.70) 

(2.68) 

8 

COS  I 

h 

cos  / 

(2.70) 

(2.70) 

(2.72) 

9 

g 

r 

(2.71) 

(2.72) 

10 

f 

V 

(2.37) 

(2.37) 

11 

r 

V 

(2.37) 

(2.37) 

(2.57) 

VLE1 

(2.59) 

VLE2 

(2.59) 

VLE3 

(2.59) 

VLH1I 

(2.59) 

sin  i" 5 1  h 
(2.59) 

z 

(2.67) 

S 2  e 
(2.61) 

e52£ 

(2.63) 

sin  i" S2  h 
(2.65) 

<5e  a 


(2.62)  (2.69) 


VLH2I 

VLH3I 

VLSI 

VLS2 

VLS3 

(2.59) 

(2.59) 

(2.60) 

(2.60) 

(2.60) 

S2  a 

sectoral 

(2.18) 

(LUNAR) 

was  to  determine  which  tasks  could  be  completed  concurrently.  Concurrency  was  determined 
by  the  development  of  a  hierarchy  of  the  formulas  used  by  the  NAVSPASUR  model.  Each 
of  the  individual  tasks  were  listed  with  its  respective  required  input.  Tasks  which  could  be 
executed  concurrently  were  listed  on  the  same  row  of  Table  1. 

Remark:  The  equation  numbers  in  the  table  refer  to  Phipps  (1992)  . 

From  this  table,  one  can  see  that  the  number  of  tasks  that  could  be  computed  concur¬ 
rently  at  each  level  ranges  from  2  to  14.  Additionally,  the  computational  requirements  vary 
considerably  among  the  tasks,  for  example,  the  compuational  requirement  for  the  solution 
of  Kepler’s  equation  by  Steffensen’s  method  depends  on  the  number  of  iterations  necessary 
to  achieve  convergence.  This  variance  in  the  number  of  operations  required  by  the  various 
tasks  presented  a  potential  problem  in  load  balancing.  In  other  words,  bottlenecks  are  due 
to  the  fact  that  nodes  are  awaiting  to  receive  results  from  computations  performed  by  other 
processors.  It  was  shown  by  Phipps  (1992)  that  a  manager-worker  algorithm  (to  achieve 
load  balancing)  will  increase  the  communication  and  thus  decrease  efficiency.  Thus  pre¬ 
scheduling  of  tasks  is  done.  The  optimal  number  of  nodes  is  found  to  be  four.  In  Table  2, 
we  list  the  tasks  scheduled  for  each  node.  A  computer  program,  P3T  —  4,  was  developed 
for  the  hypercube.  Experiments  with  this  program  show  that  the  computation  time  (tc)  for 
P3T  —  4  is  about  half  that  of  PPT2.  Unfortunately,  the  communication  time  (tm)  was  so 
high  that  the  total  time  for  P3T  —  4  was  larger  (see  table  3). 

One  method  to  reduce  the  ratio  of  communication  to  computation  is  by  computing  the 
path  of  n  satellites  at  the  same  time.  In  other  words,  currently  the  program  PPT2  reads 
the  initial  values  of  one  satellite  and  computes  its  position  at  a  given  time,  and  then  moves 
on  to  the  computation  of  the  next  satellite  position.  Since  each  communication  requires 
an  overhead,  it  is  cheaper  to  send  a  long  message.  To  arrange  that,  we  suggest  that  the 
program  reads  initial  values  of  several  satellites  and  computes  the  paths  concurrently.  This 
will  require  the  same  number  of  messages,  but  each  one  is  n  times  longer.  The  efficiency, 
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Table:  2  Tasks  for  each  node 


Node  0 

Node  1 

Node  2 

Node  3 

Recover  a" 

137  flops 
send  8  bytes 

Compute  T2 

113  flops 
send  8  bytes 

Compute  secular 
corrections  -  £,  a,  and 

Compute  secular 
correction  -  g 

Compute  long  period 
corection  -  z 

Compute  secular 
correction  -  h 

e 

45  flops 
send  24  bytes 

80  flops 
send  8  bytes 

113  flops 

85  flops 
send  8  bytes 

Compute  long  period 
correction  -  l 

63  flops 

Compute  long  period 
corrections  -  e  and  / 

64  flops 

Solve  Kepler’s 

Equation 
~  308  flops 

Compute  sectoral 
terms 

528  flops 
send  48  bytes 

Compute  short  period 
correction  -  l 

46  flops 
send  8  bytes 

Compute  short  period 
corrections  -  e  and  / 

88  flops 
send  16  bytes 

Compute  short  period 
correction  -  z 

14  flops 
send  8  bytes 

Compute  long  period 
correction  -  h 

69  flops 

Compute  short  period 
correction  -  a 

24  flops 

Compute  short  period 
correction  -  h 

52  flops 
send  24  bytes 

Solve  Kepler’s 

Equation 
~  308  flops 

Collect  all  terms 

Compute  state  vector 
74  flops 

4 


Table:  3  P3T  —  4  Execution  Time  Breakdown 


Algorithm 

tc  (milliseconds) 

tm  (milliseconds) 

ti  (milliseconds) 

PPT2 

(one  node) 

11.2 

NA 

11.2 

P3T  -  4 

node  0 

4.3 

19.0 

23.3 

node  1 

2.2 

15.9 

18.1 

node  2 

2.7 

14.7 

17.4 

node  3 

5.8 

15.7 

21.5 

is  given  by 


El 


nt  i 

p{ntc  T  tm) 


since  the  communication  time  is  not  affected  by  the  length  of  the  message.  As  one  increases 
the  number  n,  the  limit  is 


lim 

n— yoo 


El 


h 

ptc 


Using  the  values  in  the  above  table  one  finds  that  the  efficiency  is  bounded  by  0.49.  This 
is  the  best  we  were  able  to  achieve.  The  reason  is  that  the  computation  time  for  1  satellite 
is  5.8  sec  on  4  nodes  and  11.2  see  on  1  node.  Thus  the  maximum  achievable  efficiency  is 
bounded  by  .5.  Since  this  is  not  high  enough,  we  have  tried  domain  decomposition.  This  is 
discussed  in  the  next  section. 


Domain  Decomposition 


The  strategy  of  domain  decomposition  is  to  reduce  the  computation  time  by  the  concurrent 
computation  of  several  satellites’  state  vectors.  Each  node  of  the  hypercube  would  complete 
identical  tasks  on  different  satellite  data  sets,  simultaneously. 

Unlike  the  application  of  the  control  decomposition  strategy,  the  application  of  the  do¬ 
main  decomposition  strategy  to  the  NAVSPASUR  model  was  seemingly  less  arduous.  First, 
because  each  node  propagates  satellite  data  sets  independent  of  the  other  nodes,  there  exists 
no  requirement  for  communication  or  synchronization  among  the  nodes.  This  lack  of  com¬ 
munication  simplifies  the  load  balancing  and  sequential  bottleneck  problems  present  in  the 
P3T  —  4  parallel  algorithm. 

Second,  because  each  node  may  perform  the  satellite  state  vector  prediction  tasks  serially, 
the  existing  subroutine  PPT 2  may  be  used  with  only  minor  modifications.  Developing  a 
parallel  algorithm  for  predicting  an  individual  satellite’s  state  vector  was  a  major  task  for  the 
control  decomposition  strategy.  Additionally,  by  using  the  existing  PPT 2  code,  the  other 
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Collecting 

Node 


Figure  1:  P3T  Algorithm 


tasks  completed  by  PPT 2  may  be  requested  by  the  user  using  l  lie  same-  control  variables 
as  used  by  the  original  PPT 2  subroutine.  I  lie  P3T  —  4  program  set  was  restricted  to  only 
predicting  a  satellite’s  state  vector. 

Finally,  by  using  the  serial  subroutine  PPT 2,  this  strategy  may  be  reduced  to  only 
developing  an  algorithm  to  distribute  the  data  in  a  timely  manner.  Maximum  efficiency  will 
be  achieved  if  the  nodes  do  not  have  to  wait  for  satellite  data  to  propagate. 

Intuitively,  this  strategy  seems  perfectly  parallelizable.  Although  the  various  tasks  per¬ 
formed  by  PPT 2  require  different  computation  times,  the  total  execution  time  for  each  node 
will  be  essentially  the  same  if  it  is  assumed  that  the  various  tasks  are  randomly  distributed 
throughout  the  input  data  sets.  The  concern  for  this  algorithm  was  the  potential  sequential 
bottlenecks  at  input /output  portions  of  the  program  set.  Reading  and  writing  to  external 
hies  can  be  very  time  consuming.  In  addition  to  the  actual  time  spent  reading/writing  to  an 
external  hie,  a  certain  amount  of  time  is  spent  to  access  the  hie.  In  order  to  minimize  this 
time,  the  number  of  calls  to  read/write  to  a  hie  should  be  minimized. 

With  the  specific  iPSC/2  hypercube  available,  input /output  is  completed  sequentially. 
Each  node  must  compete  with  the  other  nodes  to  read  and  write  to  external  hies.  To 
minimize  time  lost  to  accessing  the  hie  cataloging  the  set  of  satellites,  a  node  was  devoted 
to  both  the  reading/distributing  of  input  satellite  data  and  to  the  collecting/writing  of  the 
results.  The  idea  of  using  a  single  node  to  read  the  data  and  a  single  node  to  subsequent 
write  the  output  is  simple  to  implement  and  proved  to  be  fastest  method  to  overcome  the 
bottlenecks  with  the  input /output.  The  remaining  nodes  of  the  hypercube  implement  the 
NAVSPASUR.  model  using  a  slightly  modified  PPT 2.  The  diagram  in  Figure  1  depicts  how 
the  satellite  data  is  distributed.  The  cost  of  using  this  simple  algorithm  to  distribute  and 
collect  the  data  is  the  loss  of  two  nodes.  The  only  restriction  on  the  size  of  thfg  hypercube 
required  by  P3T  is  that  the  attached  cube  must  contain  at  least  four  nodes  to  achieve  any 
speedup. 

The  graph  in  Figure  2  depicts  the  mean  execution  time  for  P3T  versus  the  number 
of  satellites  propagated  using  hypercubes  of  four  and  eight  nodes  respectively.  P3T  was 
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Table:  4  Speedup  and  Efficiency  Comparison 


P3T 

#  of  satellites 

sP 

Ep 

8  nodes 

1728 

5.53 

.69 

144 

5.45 

.68 

12 

4.82 

.60 

4  nodes 

1728 

1.86 

.47 

144 

1.86 

.47 

12 

1.82 

.46 

successful  in  reducing  the  overall  execution  time  to  propagate  several  satellites.  Table  4 
shows  the  speedup  and  efficiency  of  P3T  for  a  various  number  of  satellites.  As  seen  in 
Table  4,  the  speedup  achieved  using  all  eight  nodes  of  the  hypercube  was  approximately 
three  times  larger  than  the  speedup  achieved  using  four  nodes.  With  this  parallel  algorithm 
using  six  “working”  nodes  for  an  eight  processor  hypercube  and  only  two  “working”  nodes 
for  a  four  processor  hypercube,  an  increase  in  speedup  by  approximately  a  factor  of  three 
was  expected.  In  other  words,  since  two  processors  are  tied  to  input/output  and  cannot  be 
used  for  computation,  one  should  expect  the  gain  to  increase  until  we  recover  the  loss  of 
those  two.  More  notable  was  the  increase  in  efficiency  using  eight  versus  four  nodes.  The 
efficiency  increased  from  .45  to  .67.  This  increase  in  efficiency  indicates  that  P3T  applied  to 
a  hypercube  of  greater  dimension  could  yield  even  greater  speedup  and  efficiency. 

Table  4  also  indicates  that  P3T  performance  increased  somewhat  with  an  increase  in  the 
total  number  of  satellites  propagated.  Because  with  this  parallel  algorithm  the  computation 
to  communication  ratio  does  not  vary  with  the  number  of  satellites,  this  small  increase  in 
performance  must  be  primarily  due  to  the  diminishing  impact  of  the  algorithm’s  overhead 
on  total  execution  time.  This  overhead  includes  one  additional  message  containing  the 
total  number  of  satellites  to  propagate  from  the  distributing  node  to  the  other  nodes;  some 
small  computations  by  working  nodes  to  determine  number  of  data  sets  to  receive;  and  a 
halting  message  sent  by  the  collecting  node  to  the  host  once  all  of  the  nodes  are  finished. 
Because  these  additional  messages  and  computations  are  only  completed  once  in  the  program, 
the  time  cost  associated  with  this  overhead  becomes  negligible  as  the  number  of  satellites 
propagated  is  increased.  The  speedup  and  efficiency  remained  fairly  constant  for  greater 
than  144  satellites. 

The  performance  results  of  this  algorithm  using  only  four  and  eight  nodes  indicated  a 
potential  increase  in  both  speedup  and  efficiency  if  this  algorithm  could  be  applied  to  a 
hypercube  of  greater  dimension.  Because  the  number  of  working  nodes  is  not  fixed  for  this 
algorithm,  P3T  could  be  applied  easily  to  any  size  hypercube  with  no  modifications. 

The  efficiency  of  the  algorithm  should  increase  with  the  number  of  processors  until  the 
time  to  distribute  a  separate  satellite  data  to  each  working  node  exceeds  the  time  required 
by  a  node  to  propagate  a  single  satellite.  A  model  was  used  to  estimate  the  optimal  number 
of  nodes.  The  total  execution  time  for  P3T  to  propagate  n  satellites  with  p  processors,  t(p), 
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18 


Figure  2:  Theoretical  Speedup  for  Propagating  2044  Satellites 
can  be  modelled  by 

t{p)  =  twiip)  +  twiip)  +  tc{p)  , 

where  twi(p)  is  the  time  the  last  node  must  wait  to  receive  its  first  satellite  data  set,  tw2{p)  is 
the  total  time  the  last  node  must  wait  to  receive  all  of  its  subsequent  satellite  data  sets,  and 
tc(p)  is  the  computation  time  for  each  node  to  propagate  its  share  of  the  n  satellites.  For 
this  algorithm,  there  are  p  —  2  working  nodes.  Denoting  the  time  to  send  a  single  message 
between  the  distributing  node  and  a  working  node  as  tm(  1),  the  twi(p)  may  be  modeled  by 
the  following: 

twiip)  =  {p  ~  3)tm(l) 

where  tm(  1)  denotes  the  time  to  send  a  single  message  between  the  distributing  and  working 
nodes.  For  the  iPSC/2  it  was  found  that 

tm(  1)  =  .693  msec. 

The  wait  time  tw2  is  zero  unless  the  number  of  working  nodes  is  large  enough,  i.e. 

JO  twl(p)  < 

“'2  P  ~  \  {twiip)  -  tl)  [^2  “  X]  ’  Wp)  > 

where  t\  is  the  computation  time  to  propagate  a  single  satellite  (11.2  msec).  Note  that  the 
factor  —  1  is  the  number  of  subsequent  satellite  data  sets.  The  computation  time  tc  is 
given  by 

n 

tdp)  =  - • 

p  —  2 

Therefore,  the  speedup  and  efficiency  are  given  by 

5  =  _ n  *  ti _ 

‘~P  twiip)  +  tw2{p)  +  tc(p)  ’ 
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Figure  3:  Theoretical  Efficiency  for  Propagating  2044  Satellites 


P  [twiip)  +  tw2{p)  +  tc{p)]  ' 

Figures  2  and  3  depict  these  theoretical  estimates  of  Sp  and  Ep  for  propagating  2044  satellites 
using  4  to  1024  processors.  Using  the  above  model,  P3T  is  capable  of  achieving  a  maximum 
speedup  of  13.95  and  an  efficiency  of  0.87  using  16  nodes. 


Conclusions 

In  this  paper  we  have  developed  two  ideas,  control  decomposition  and  domain  decomposition, 
to  parallelize  the  NAVSPASUR  satellite  motion  model.  The  control  decomposition  idea 
is  not  efficient  because  the  model  is  not  computationally  intensive  enough.  The  domain 
decomposition  can  reach  an  efficiency  of  87%  when  using  a  16  -  node  hypercube.  There 
are  many  orbit  models  in  use  nowadays.  Several  questions  can  be  raised  as  a  result  of 
this  research.  How  should  an  orbit  theory  be  organized  to  take  an  advantage  of  MIMD 
computers?  How  should  a  semianalytic.  theory  be  organized  for  parallel  computers?  We  are 
now  working  on  a  parallel  version  for  the  analytical  model  SGP4  in  use  by  USSPACECOM 
and  for  the  semianalytic  satellite  model  developed  at  Draper  Laboratory. 
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